ICCS Summer School 2025
Working definition:
A computing resource that is larger than can be provided by one laptop or server
One of the most performant computers in the world at a particular point in time.
An architecture for combining a number of servers, storage and networking to act on concert.
Most supercomputers for the past few decades have been clusters.
Why would I need a supercomputer?
Three traditional applications:
Now, AI
Computer math is not people math
>>> 0.1 + 0.2
>>> 0.1 + 0.2
0.30000000000000004
One FLOPS == one floating point operation per second.
Conventionally these are 64-bit (“double precision”) FLOPS
Image source: Felix LeClair
A benchmark is a particular known and specified workload which can be repeated on different systems and the performance compared.
A typical weather related one is WRF running the CONUS 2.5km configuration.
LINPACK is a software library for performing numerical linear algebra
LINPACK makes use of the BLAS (Basic Linear Algebra Subprograms) libraries for performing basic vector and matrix operations.
The LINPACK benchmarks appeared initially as part of the LINPACK user’s manual. The parallel LINPACK benchmark implementation called HPL (High Performance Linpack) is used to benchmark and rank supercomputers for the TOP500 list.
Got to the Top500 site at https://top500.org/
Before we get to the computing infrastructure there is the underpinning building and plant (power, cooling) required
The name comes from the terminology of mathematical graphs - nodes and edges.
You can think of a node as a single server - one computer that an instance of an operating system
These are your entry point on to the cluster
Usually accessable from the outside world.
Often more than one (sometimes multiple login nodes use the same DNS name, e.g . login.hpc.cam.ac.uk)
Shared with multiple users.
DO NOT RUN COMPUTE JOBS ON THE LOGIN NODE
These are the nodes that do the heavy lifting computing work.
Normally managed by the job scheduler - you don’t usually log in to them directly.
Quite often for the exclusive use of one user for the duration of their job.
N.B. On some clusters compute nodes can be of a different architecture to the login nodes.
Compute nodes sometimes have on node disk storage.
Ther is normally some large storage that is visible to all the compute nodes.
Since this is a shared resource an anti-social user can affect the performnace of other users.
Connects the compute nodes, login nodes and storage
Usually faster (higher bandwidth, lower latency) than comoddity ethernet networking.
It’s what makes a supercomputer super.
examples: - Infiniband - Omnipath - Slingshot
login.hpc.cam.ac.ukThe scheduler takes requests to run jobs with particular cluster resources, fits these in around other user’s jobs according to some policy, launches the job, terminates the job if it is overrunning, does accounting.
Examples: - PBSpro - Platform LSF - Flux - Slurm (today, on CSD3)
A shell script with shell comments that are directives to the sheduler about how the jobs should be run
sbatch job.sh
You will get back a Job ID.
squeuesqueue --meIf you don’t specify, by default it will be called slurm-<$JOBID>.out
To change this you can add an extra directive #SBATCH --output=
sbatchsqueue --mels -lrtcatsleep 60squeue --mescancel <JOBID>\(S = 1 / (1 - p + p/s)\), where…
printf()gdb, lldb, linaro ddt…)Warning!
Premature Optimization Is the Root of All Evil
Donald Knuth (1974)
Advice:
For more information we can be reached at:
You can also contact the ICCS, make a resource allocation request, or visit us at the Summer School RSE Helpdesk.